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Abstract 

In this paper, we consider estimating sparse inverse covariance of a Gaussian graphi- 
cal model whose conditional independence is assumed to be partially known. Similarly as 
in [5], we formulate it as an /i-norm penalized maximum likelihood estimation problem. 
Further, we propose an algorithm framework, and develop two first-order methods, that 
is, the adaptive spectral projected gradient (ASPG) method and the adaptive Nesterov's 
smooth (ANS) method, for solving this estimation problem. Finally, we compare the 
performance of these two methods on a set of randomly generated instances. Our com- 
putational results demonstrate that both methods are able to solve problems of size at 
least a thousand and number of constraints of nearly a half million within a reasonable 
amount of time, and the ASPG method generally outperforms the ANS method. 

Key words: Sparse inverse covariance selection, adaptive spectral projected gradient 
method, adaptive Nesterov's smooth method 
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1 Introduction 

It is well-known that sparse undirected graphical models are capable of describing and explain- 
ing the relationships among a set of variables. Given a set of random variables with Gaussian 
distribution, the estimation of such models involves finding the pattern of zeros in the in- 
verse covariance matrix since these zeros correspond to conditional independencies among the 
variables. In recent years, a variety of approaches have been proposed for estimating sparse 
inverse covariance matrix. (All notations used below are defined in Subsection II. 1[ ) Given a 
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sample covariance matrix S G 5", d'Aspremont et al. [5] formulated sparse inverse covariance 
selection as the following /i-norm penalized maximum likelihood estimation problem: 

max {logdetX - -pe^|X|e : X ^ 0}, (1) 

where p > is a parameter controlling the trade-off between likelihood and sparsity of the so- 
lution. They also studied Nesterov's smooth approximation scheme [IDj and block-coordinate 
descent (BCD) method for solving ([1]). Independently, Yuan and Lin [13] proposed a similar 
estimation problem to ([1]) as follows: 

max {logdetX- -pV|Xy| : X ^ 0}. (2) 

They showed that problem ([2]) can be suitably solved by the interior point algorithm developed 
in Vandenberghe et al. [12] . As demonstrated in [5l |T3] , the estimation problems ([1]) and ([2]) 
are capable of discovering effectively the sparse structure, or equivalently, the conditional 
independence in the underlying graphical model. Recently, Lu [8J proposed a variant of 
Nesterov's smooth method [10] for problems ([1]) and ([2]) that substantially outperforms the 
existing methods in the literature. In addition, Dahl et al. studied the maximum likelihood 
estimation of a Gaussian graphical model whose conditional independence is known, which 
can be formulated as 

max {logdetX - : X t 0, = 0,V(i,j) G E}, (3) 

where £' is a collection of all pairs of conditional independent nodes. They showed that 
when the underlying graph is nearly-chordal, Newton's method and preconditioned conjugate 
gradient method can be efficiently applied to solve ([3]). 

In practice, the sparsity structure of a Gaussian graphical model is often partially known 
from some knowledge of its random variables. In this paper we consider estimating sparse 
inverse covariance of a Gaussian graphical model whose conditional independence is assumed 
to be partially known in advance (but it can be completely unknown). Given a sample co- 
variance matrix S G 5", we can naturally formulate it as the following constrained /i-norm 
penalized maximum likelihood estimation problem: 

max logdetX — (S,X) — ^ pij\Xij\, 

^ (4) 
s.t. X yo, Xij = o,v(i, j) G n, 

where Q consists of a set of pairs of conditionally independent nodes, and {pij}(ij)i^n is a 
set of nonnegative parameters controlling the trade-off between likelihood and sparsity of the 
solution. It is worth mentioning that unlike in [3], we do not assume any specific structure 
on the sparsity of underlying graph for problem (jlj). We can clearly observe that (i) {i, i) ^ Q 
for 1 < i < n, and G f2 if and only if [j, i) G VL\ (ii) pij = pji for any (^, j) ^ fi; and (iii) 
problems ([I])- ([3]) can be viewed as special cases of problem (jl]) by choosing appropriate Q and 
{Pij}(i,j)t^- -^o^ example, if setting f2 = and pij = p for all problem (j4]) becomes ([1]). 
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It is easy to observe that problem (jlj) can be reformulated as a constrained smooth con- 
vex problem that has an explicit (9(n^)-logarithmically homogeneous self-concordant barrier 
function. Thus, it can be suitably solved by interior point (IP) methods (see Nesterov and 
Nemirovski [TT] and Vandenberghe et al. [12j). The worst-case iteration complexity of IP 
methods for finding an e-optimal solution to (jll) is (9(n log(eo/e)), where eo is an initial gap. 
Each iterate of IP methods requires 0{n^) arithmetic cost for assembling and solving a typ- 
ically dense Newton system with 0{n^) variables. Thus, the total worst-case arithmetic cost 
of IP methods for finding an e-optimal solution to (jl]) is (9(?t,^ log(eo/e)), which is prohibitive 
when n is relatively large. 

Recently, Friedman et al. |6j proposed a gradient type method for solving problem (j4]). 
They first converted (jl]) into the following penalization problem 



by setting pij to an extraordinary large number (say, 10^) for all G fi. Then they applied 
a slight variant of the BCD method ^ to the dual problem of ([5]) in which each iteration 
is solved by a coordinate descent approach to a lasso (/i-regularized) least-squares problem. 
Given that their method is a gradient type method and the dual problem of ([5]) is highly 
ill-conditioned for the above choice of p, it is not surprising that their method converges 
extremely slowly. Moreover, since the associated lasso least-squares problems can only be 
solved inexactly, their method often fails to converge even for a small problem. 

In this paper, we propose adaptive first-order methods for problem (j4j). Instead of solving 
([5]) once with a set of huge penalty parameters {Pij}(i,j)<^n, our methods consist of solving 
a sequence of problems ([5]) with a set of moderate penalty parameters {pij}(ij)^n that are 
adaptively adjusted until a desired approximate solution is found. For a given p, problem 
([5]) is solved by the adaptive spectral projected gradient (ASPG) method and the adaptive 
Nesterov's smooth (ANS) method that are proposed in this paper. 

The rest of paper is organized as follows. In Subsection ll.il we introduce the notations used 
in this paper. In Section [21 we propose an algorithm framework and develop two first-order 
methods, that is, the ASPG and ANS methods, for solving problem (jl]). The performance 
of these two methods are compared on a set of randomly generated instances in Section [31 
Finally, we present some concluding remarks in Section [H 

1.1 Notation 

In this paper, all vector spaces are assumed to be finite dimensional. The symbols 9ft", 3ft" 
and 3ft"_|_ denote the n-dimensional Euclidean space, the nonnegative orthant of 3ft" and the 
positive orthant of 3ft"', respectively. The set of all m x n matrices with real entries is denoted 
by 3ft"*^". The space of symmetric nxn matrices will be denoted by 5". If X G 5" is positive 
semidefinite, we write X ^ 0. Also, we write X ^ Y to mean Y — X y 0. The cone of 
positive semidefinite (resp., definite) matrices is denoted by 5" (resp., 5"^). Given matrices 
X and Y in Sft™'^", the standard inner product is defined by (X, Y) := Tr(Xy-^), where Tr(-) 




(5) 
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denotes the trace of a matrix. || • || denotes the Euchdean norm and its associated operator 
norm unless it is exphcitly stated otherwise. The Frobenius norm of a real matrix X is defined 
as II^IIf := \/Tr(XX^). We denote by e the vector of all ones, and by / the identity matrix. 
Their dimensions should be clear from the context. For a real matrix X, we denote by |X| 
the absolute value of X, that is, \X\ij = \Xij\ for all The determinant and the minimal 
(resp., maximal) eigenvalue of a real symmetric matrix X are denoted by detX and Amin(X) 
(resp., Ainax(-^))5 respectively, and Xi{X) denotes its ith largest eigenvalue. Given an n x n 
(partial) matrix p, Diag(p) denotes the diagonal matrix whose ith diagonal element is pa for 

1 = 1, . . . ,n. Given matrices X and Y in 3??™-^", X *Y denotes the pointwise product of X 
and Y, namely, X *Y & gfjmxn -y^^j^ose ijth entry is XijYij for all We denote by the 
set of all nonnegative integers. 

2 Adaptive first-order methods 

In this section, we discuss some suitable first-order methods for general sparse inverse covari- 
ance selection problem (jl]). In particular, we first provide an algorithm framework for it in 
Subsection 12.11 Then we specialize this framework by considering two first-order methods, 
namely, the adaptive spectral projected gradient method and the adaptive Nesterov's smooth 
method in Subsection 12. 2[ 

2.1 Algorithm framework 

In this subsection, we provide an algorithm framework for general sparse inverse covariance 
selection problem (jlj). 

Throughout this paper, we assume that pij > is given and fixed for all ^ Q, and 
that the following condition holds. 

Assumption 1 S -|- Diag(p) y 0. 

Note that S is a sample covariance matrix, and hence S ^ 0. In addition, Diag(p) >z 0. 
Thus, S -|- Diag(p) >z 0. It may not be, however, positive definite in general. But we can 
always perturb pa by adding a small positive number (say, 10~^) whenever needed to ensure 
the above assumption holds. 

We first establish the existence of an optimal solution for problem (jl]) as follows. 

Proposition 2.1 Problem has a unique optimal solution X* G 3"^+. 

Proof. Since (i, i) ^ fl for i = 1, . . . ,n, we see that X = / is a feasible solution of problem 
dl]). For convenience, let f{X) denote the objective function of (HI). We now show that the 
sup-level set Sf{I) = {X y : f{X) > f{I), Xij = 0, \/{i,j) G Q} is compact. Indeed, using 
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the definition of /(■), we observe that for any X G Sf{I), 

n 

f{I) < f{X) < logdetX-(S + Diag(p),X) < J] [log A,(X) - A,nin(S + Diag(p))A,(X)] 

i=l 

< (n-1) [-1 - log A„,in(S + Diag(p))] + log A^ax(^) - Kin{^ + Diag(p))A^ax(X), 

where the last inequality follows from the fact that for any a > 0, 

max {logt — at : t > 0} = — 1 — logo. (6) 

Hence, we obtain that for any X G Sf{I), 

logA„,ax(X)-An,in(S + Diag(p))A^ax(X) > /(/)-(r2-l)[-l-logA^in(S + Diag(p))], (7) 

which implies that there exists some /3(p) > such that Amax(-'^) < Pip) for all X G 5/(1). 
Thus, 5/(J) C {X G 5" : ^ X ^ Pip) I}- Further, using this result along with the 
definition of /(■), we easily observe that for any X G Sfil), 

n-1 

hgX^UX) = /(X)- glogA,(X) + (S,X)+ E p,,|X,,|, 

> /(/)- (n - 1) log /?(p)+ mill {(S,X)+ E P^J\X^J\}. 

It follows that there exists some a(p) > such that Amin(X) > a(p) for all X G Sfil). Hence, 
Sfil) C {X G 5" : a(p)/ ^ X ^ /3(p)/} is bounded, which together with the fact that /(■) is 
continuous in the latter set, implies that Sfil) is closed. Therefore, problem (jl]) has at least 
an optimal solution. Further, observing that /(•) is strict concave, we conclude that problem 
(jl]) has a unique optimal solution. ■ 

Similarly, we can show that the following result holds. 

Proposition 2.2 Given any pij > for ii,j) G Q, problem ^ has a unique optimal solution 
X* G 5^+. 

Before presenting an algorithm framework for problem (jlj), we introduce a terminology for 
(jlj) as follows. 

Definition 1 Let to > and ec > be given. Let /(■) and f* denote the objective function 
and the optimal value of 0), respectively. X E S!^ is an ieo, ec)- optimal solution of problem 
if fix) > f* - to and max |Xy| < e^. 

Analogously, we can define an eo-optimal solution for problem Given that our ultimate 
aim is to estimate a sparse inverse covariance matrix X* >z that satisfies at least X*j = 0, 
V(i,_7) G Q and approximately maximizes the log-likelihood, we now briefiy discuss how to 
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obtain such an approximate solution X* from an (eo, ec)-optimal solution X* of (jl]). Let us 
define X* G 5" by letting X* = X*-, V(i, j) ^ and X*- = 0, V(i, j) G fi. We then set 
X* := X* + where 

t* = argmax {logdet(X* + tJ) - (S,X* + : t > -Amm(X*)}. 

It is not hard to see that t* can be easily found. We also observe that such X* belongs to iS^^, 
satisfies X*j = 0, ^{i-ij) G ^2 and retains the same sparsity as X*. In addition, by setting the 
log-likelihood value at X* to — oo if Amin(-^*) < 0, we can easily see that the log-likelihood 
value at X* is at least as good as that at X*. Thus, X* is a desirable estimation of sparse 
inverse covariance, provided X* is a good approximate solution to problem (jl]). 

In the remainder of this paper, we concentrate on finding an (eo, ec)-optimal solution of 
problem (jlj) for any pair of positive (eo, ec). We next present an algorithm framework for (jlj) 
based on an adaptive li penalty approach. 

Algorithm framework for general sparse inverse covariance selection (GSICS): 

Let eo > 0, ec > and > 1 be given. Let > 0,V(i,j) G i7 be given such that 
P% = P%Mhj) e n. Set p,, = pO. for all (z, j) G n. 

1) Find an eo-optimal solution X*^" of problem ([5]). 

2) If max \X^°\ < ec, terminate. Otherwise, set pij ^ Pij^p for all (i,j) G Q, and go to 
step 1). 

end 

Remark. To make the above framework complete, we need to choose suitable methods 
for solving problem ([5]) in step 1). We will propose first-order methods for it in Subsection 
12.21 In step 2) of the framework GSICS, there are some other strategies for updating the 
penalty parameters {pij}{i,j)en- For example, for any (z,j) G Q, one can update pij only if 
Pij > ec- But we observed in our experimentation that this strategy performs worse than the 
one described above. In addition, instead of using a common ratio Vp for all G Q, one 
can associate with each pij an individual ratio Vij. Also, the ratio Vp is no need to be fixed for 
all iterations, and it can vary from iteration to iteration depending on the amount of violation 
incurred in max \Xf°\ < ec- ■ 

Before discussing the convergence of the framework GSICS, we first study the convergence 
of the li penalty method for a general nonlinear programming (NLP) problem. 

Given a set 7^ A' C 3fJ" and functions f : X ^ ^, g : X and h : X ^ consider 

the NLP problem: 

/* = sup fix) 

x&x (8) 
s.t. g{x) = 0, h{x) < 0. 
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We associate with the NLP problem ([8]) the following li penalty function: 

P(x;A,^) := f{x)-X^\g{x)\-fi^h+{x), (9) 

where A G 3?^, /i G 3?^ and = max{0, hi{x)} for i = 1, . . . , /. 

We now establish a convergence result for the li penalty method for the NLP problem ([8]) 
under some assumption on f{x). 

Proposition 2.3 Let eo > and > be given. Assume that there exists some / G 3? such 
that f{x) < f for all x E X . Let E X be an eo-optimal solution of the problem 

sup {P(x; A, /i) : X e X} (10) 

for A G 9?^ and fi G and let Vx,^ '■= min{min Aj, min/ij}. Then fix'^xu) — f* ~ ^o, and, 

moreover, || (5^(2;^°^); /'-'''(a;^"^)) ||^ < Cc holds whenever vx,^ > (f — f* + ^o)/^c, where f* is the 
optimal value of the NLP problem 

Proof. In view of the assumption that f{x) is bounded above in A", we clearly see that /* 
is finite. Let /^^ denote the optimal value of problem ( fTOl) . We easily observe that /J^^ > /*. 
Using this relation, ([9]) and the fact that is an eo-optimal solution of (ITOl) . we have 

fixX,) > Pixl-Kf^) > fl,-eo > r-e„, (11) 

and hence the first statement holds. We now prove the second statement. Using ([9]), ( ITTi) and 
the definition of f a.^, we have 

- ^A,, II {9{xX,); h^ixt,)) IL ^ - ^A,, II {g{xX^); h^xl^j) \\^ 

> Pixl-\,fx) > f*-to. (12) 

Further, from the assumption, we know fi^x"^^) < f due to G X. This together with (fT2l) 
immediately implies that the second statement holds. ■ 

We are now ready to establish a convergence result for the framework GSICS. 

Theorem 2.4 Let > and ec > be given. Suppose that in step 1) of the framework 
GSICS, an eo-optimal solution X*^" of problem ^ is obtained by some method. Then, the 
framework GSICS generates an {eo,ec)- optimal solution to problem ^ in a finite number 
of outer iterations, or equivalently, a finite number of updates on the penalty parameters 

Proof. Invoking that S + Diag(p) >- (see Assumption [1]), we see that for any X G 5", 

logdetX - - Yl Pij\Xij\ < logdetX- (S + Diag(p),X) 

< sup{logdetF- (S + Diag(p),F) : Y y 0} < 00, 

where the last inequality follows from the fact that the above maximization problem achieves 
its optimal value at Y = (S + Diag(p))~^ >- 0. This observation together with Proposition 12.31 
immediately yields the conclusion. ■ 
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2.2 Adaptive first-order methods for problem ([5]) 

In this subsection, we will discuss some suitable first-order methods for solving problem ([5]) 
that appears in step 1) of the algorithm framework GSICS. 

As seen from Proposition 12.21 problem ([5]) has a unique optimal solution. We next provide 
some bounds on it. 

Proposition 2.5 Let fp{-) and X* denote the objective function and the unique optimal so- 
lution of problem respectively. Let be defined as 

^ := max + Diag(p))-i), 6} - {n - 1)[-1 - log A^i„(S + Diag(p))], (13) 

where 9 := n(-l - logTr(S + p) + logn). Then Upl ^ X* ^ (3pl, where ap := 1/(||S|| + ||p||) 
and (3p is the largest positive root of the following equation 

\ogt- A^in(S + Diag(p))t -d = Q. 

Proof. Let 

W :={[/€ 5": |f/y| < l,Vij}, (14) 

and 

0(X,f/) :=logdetX- (S + p*f/,X), V(X, f/) G 5^+ x W. (15) 

Since X* G is the optimal solution of problem ([5]), it can be easily shown that there exists 
some U* eU such that (X*, U*) is a saddle point of 0(-, ■) in x U, and hence 

X* = arg min (f)(X,U*). 
This relation along with 0151) immediately yields X*(E + p * U*) = L Hence, we have 



x; = {j: + p*u*)- y 



+ \\p*U* 



which together with (IMI) and the fact that U* G U, implies that X* y ]jsj|TjM'^' Thus, 
^ Cipl as desired. 

We next bound X* from above. Let /* denote the optimal value of problem ([5]). In view 
of the definition of /p(-) and ([6]), we have 

/* > max/p(t/) = max nlogt — tTr(S + p) = r2(— 1 — logTr(S + p) + logn) =: 9. 

Thus, /* > max{/p((S + Diag(p))^^), 9}. Using this result and following a similar procedure 
as for deriving ([7]), we can show that 

logA^ax(X;)-A^in(S + Diag(p))A^ax(X;) > ^, 
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where is given in (fT3!l . and hence the statement X* ^ (3pl immediately follows. ■ 
In view of Proposition I2.5[ we see that problem ([5]) is equivalent to the following problem 

logdetX- {^,X)-J2pij\Xijl (16) 



max 

where ctp and f]p are defined in Proposition 12. 51 

We further observe that problem ( |T6i) can be rewritten as 

max := min (17) 

where lA and 0(-, ■) are given in f|T4l) and f|T5l) . respectively, and A], is defined as follows: 

Xp := {X G 5" : a^J ^ X ^ (18) 

Observing that 0(X, U) : A'p x W ^ 3? is a smooth function which is strictly concave in 
X & Xp for every fixed U ElA, and convex ixiU ElA for every fixed X G A'p, we can conclude 
that (i) problem (fT71) and its dual, that is, 

min{^?,(f/):=max0(X,f/)} (19) 

are both solvable and have the same optimal value; and (ii) the function gp{-) is convex 
differentiable and its gradient is given by 

V^p(f/) = Vc7</)(X(f/),f/), V?7gW, 

where 

X{U) := arg max (j){X, U). (20) 
xaXp 

The following result shows that the approximate solution of problem ( |T7|) (or equivalently, 
([5])) can be obtained by solving smooth convex problem (fT9|) . 



Proposition 2.6 Let X* he the unique optimal solution of problem and let f* be the 

optimal value of problems (11) and / fiPj) . Suppose that the sequence {f/fc}^Q is such that 
QpiUk) ^ fp as k ^ oo. Then, X{Uk) — > X* and QpiUk) — fp{X{Uk)) as k ^ oo, where 
X(-) is defined in (2C 



Proof. The proof is similar to that of Theorem 2.4 of Lu [8J 



From Proposition 12. 6[ we see that problem ([5]) can be solved simultaneously while solving 
problem f|T9|) . Indeed, suppose that {f/fc}^Q C W is a sequence of approximate solutions 
generated by some method for solving f|T9|) . It follows from Proposition 12.61 that given any 
> 0, there exists some iterate such that QpiUk) — fp{X{Uk)) < ^o- Then, it is clear that 
X{Uk) is an eo-optimal solution of ffTTl) and hence (jS]). We next discuss two first order methods, 
namely, the adaptive spectral projected gradient method and the adaptive Nesterov's smooth 
method for problems (fT9|) and (fT7|) (or equivalently, ([5])). 
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2.2.1 Adaptive spectral gradient projection method 

In this subsection, we propose an adaptive spectral projected gradient (ASPG) method for 
solving problems (fT9l) and ( fTTl) (or equivalently, ([5])). 

The spectral gradient projection (SPG) methods were developed by Birgin et al. [3] for 
minimizing a smooth function over a closed convex set, which well integrate the nonmonotone 
line search technique proposed by Grippo et al. ^ and Barzilai-Borwein's gradient method [Ij 
into classical projected gradient methods (see [2]). We next discuss the one of them (namely, 
the SPG2 method [3]) for solving the problem 



and its dual 

for some /3 > Op, where 



min {gp,p{U) : U eU}, (21) 
max {/p(X) : apl ^ X ^ (31} (22) 



g,AU)-= max </.(X,f/), (23) 



U, 4>{-,-), fp{-) and ap are defined in (|T^ . f|T5|) . f|T7|) and Proposition 12. 5[ respectively. We 
denote by X^{U) the unique optimal solution of problem fl23l) . In view of f|T5|) . it is not hard 
to observe that Qp^U) is differentiable, and, moreover, Xp(U) and Vgp^U) have closed-form 
expressions for any U eU (see (30) of [8]). In addition, since W is a simple set, the projection 
of a point to U can be cheaply carried out. Thus, the SPG method [3J is suitable for solving 
problem (12T|) . 

For ease of subsequent presentation, we now describe the SPG method [3] for (|2T|) in 
details. The following notation will be used throughout this subsection. 
Given a sequence {t/fe}^o — ^ integer M > 1, we define 

r^f := max {gpAUk-j) : < j < mm{k, M - 1}}. 

Also, let Pu : 3?"^" ^ W be defined as 

Pu{U) := argmin{||f/ - U\\f : t/ G W}, Vt/ G 3?"^". 
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The SPG method for problems 1^ and IHT} : 

Let Co > 0, 7 G (0, 1), < ai < cr2 < 1 and < < c^max < cx3 be given. Let M > 1 be an 
integer. Choose Uq eU, E [«min, «max] and set /c = 0. 

1) If gp,i3{Uk) - fp{Xp{Uk)) < eo, terminate. 

2) Compute 4 = PuiUk - Ok'^gpAUk)) -Uk- Set A <— 1. 
2a) Set U+ = Uk + Xdk- 

2b) If gp,piU+) < + 'y\{dk,VgpAUk)), set Uk+i = U+, Sk = Uk+i - Uk, Vk = 
V(y'p,/3(f/fc+i) — S/gp^Uk)- Otherwise, choose A+ G [ai\,a2\], set A ^ A+ and go to 
step 2a). 

2c) Compute hk = {sk,yk)- If bk < 0, set a^+i = ctmax- Otherwise, compute = 
{sk,Sk) and set ak+i = min {a^i,^,max{a^in, ak/bk}}. 

3) Set k k + 1, and go to step 1). 
end 

We next establish a convergence result for the SPG method for solving problems fl2Tl) and 
([22]). 

Theorem 2.7 Let eo > be given. The SPG method generates a pair of eo- optimal solutions 
{Uk, Xf^{Uk)) to problems l[21\) and [2^) in a finite number of iterations. 

Proof. Suppose by contradiction that the SPG method does not terminate. Then it 
generates a sequence {Uk}'k=Q ^ U satisfying gp^Uk) — fp{Xp{Uk)) > eo. Note that gp,f3{-) 
is convex, which together with Theorem 2.4 of [3J implies that any accumulation point of 
{t/fc}fclo optimal solution of problem fl2T|) . By the continuity of gp^-), it further implies 
that any accumulation point of {5'p,/3(f/fc)}^Q is the optimal value /* of (!2T!) . Using this 
observation and the fact that {gp,i3{Uk)}'k=Q is bounded, we conclude that gp^Uk) fp as 
k CO. Further, in view of Proposition 12.61 by replacing Pp with j3, and gp{-) with gp^-), 
we have gp^/3{Uk) — fp{Xf^{Uk)) —>■ as k ^ oo, and arrive at a contradiction. Therefore, the 
conclusion of this theorem holds. ■ 

Based on the above discussion, we see that the SPG method can be directly applied to 
find a pair of eo-optimal solutions to problems ( |T9|) and ( |T7I) (or equivalently, ([5])) by setting 
(3 = f3p, where f3p is given in Proposition 12.51 It may converge, however, very slowly when 
Pp is large. Indeed, similarly as in [8], one can show that Vgp^U) is Lipschitz continuous 
on U with constant L = /^^(maxpjj)^ with respect to the Frobenius norm. Let a^, bk and dk 

id 

be defined as above. Since gp^-) is convex, we have &fc > 0. Actually, we observed that it 
is almost always positive. In addition, amin and Omax are usually set to be 10~^° and 10^°, 
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respectively. Thus for the SPG method, we typically have 

\\Uk+i-Uk\\l ^ 1 



ttfc+i 



> 



{Uk+i -Uk,Vgp,i3{Uk+i) -Vgp,f3{Uk)) L (3'^{maxpij] 



Recall that Pp is an upper bound of Xmaxi^p), and typically it is overly large, where X* is the 
optimal solution of When j3 = jSp, we see from above that can be very small and so is 
f/fc+i - Uk due to 

||f^fc+i - f^fcllF < ||4||f = \\PuiUk- akVgp^p{Uk)) -UkWr < ak\\Vgp^p{Uk)\\F- 

Therefore, the SPG method may converge very slowly when applied to problem (|T9l) directly. 

To alleviate the aforementioned computational difficulty, we next propose an adaptive SPG 
(ASPG) method for problems ( |T9l) and ( ITTl) (or equivalently, ([5])) by solving a sequence of 
problems ( 1211) with /3 = /3o, for some {Pk}T=o approaching Amax(-^p) monotonically 

from below. 

The adaptive SPG (ASPG) method for problems ([HD and (HM : 

Let Co > 0, /So <^ Pp and > 1 be given. Choose Uq ElA and set = 0. 

1) Set P <— Pk- Apply the SPG method to find a pair of eo-optimal solutions {Uk, X[^{Uk)) 
to problems ( l2Ti) and ( l22l) starting from Uq. 

2) li P = Pp or Xrai,^{Xf3{Uk)) < P, terminate. 

3) Set Uq ^ Uk, Pk+i = niin{/?r/3, k k + 1, and go to step 1). 
end 

We now establish a convergence result for the ASPG method for solving problems (|T9l) 
and f[T7|) (or equivalently, (^). 

Theorem 2.8 Let to > be given. The ASPG method generates a pair of eo-optimal solutions 
to problems ( fi^) and p7| j (or equivalently, ^) in a finite number of total (inner) iterations. 

Proof. First, we clearly see that P is updated for only a finite number of times. Using 
this observation and Theorem 12. 7^ we conclude that the ASPG method terminates in a finite 
number of total (inner) iterations. Now, suppose that it terminates a.t P = Pk for some k. 
We claim that {Uk, Xp{Uk)) is a pair of Co-optimal solutions to problems (fT9l) and (fT7|) (or 
equivalently, ([5])). Indeed, we clearly have P = Pp or Xmax{X ^{Ukj) < P, which together with 
the definition of gp{-) and gp,p{-) (see ( IT^ and ([21])), implies that gp{Uk) = gp,p{Uk)- Thus, 
we obtain that 

gp{Uk) - fp{Xp{Uk)) = gpAUk) - fp{Xp{Uk)) < eo, 

which along with the fact Xp{JJk) G PCp, implies that {Uk, Xj3{Uk)) is a pair of eo-optimal 
solutions to problems (fT9l) and ([1] 
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As discussed above, the ASPG method is able to find a pair of eo-optimal solutions to 
problems ([5]) and ( fT9l) . We now show how this method can be extended to find an (eo, Cc)- 
optimal solution to problem (j4j). Recall from the framework GSICS (see Subsection 12.11) 
that in order to obtain an (eo, ec)-optimal solution to problem Q, we need to find an eo- 
optimal solution of problem ([5]) for a sequence of penalty parameters {p''}^^^, which satisfy 
for k = 1, . . . ,m, pfj = pij, ^ ^ and p^^ = p^if^p'^, V(z,j) G VL for some Vp > 1 and 

p°- > 0, G VL. Suppose that a pair of eo-optimal solutions {Xj3^{Uk)),Uk) of problems 

(El) and ( fT9l) with p = p^ are already found by the ASPG method for some 13k G [apk,(3pk\. 
Then, we choose the initial Uq and /5o for the ASPG method when applied to solve problems 
([5]) and ([I9l) with p = p^^'^ as follows: 

= { ll!f " otierwil"' • * = A...„(X,JC/.))} . (24) 

We next provide some interpretation on such a choice of Uq and (3q. Since tj^ ElA and > 1, 
we easily see that Uq G U. In addition, using the definition of f3p (see Proposition I2.5P and the 
fact that Diag(p'^"'""^) = Diag(p'^), we observe that [3pk+i = Ppk, and hence /3o G [apk+i, Ppk+i]. 
Let /* denote the optimal value of problem ([5]) for any given p. Clearly, we can observe from 

the ASPG method that either Amax(-^/3fc(?^fc)) < Pk < Pp^ or Amax(-^&(f/fe) < Pk = Ppk holds, 
which together with (|T9l) and fl23|) implies that 

VA(^fc) = V(f^'t) e [/>,/> + eo]. (25) 

Typically, Xmay,{Xf^i^{Uk) > a^fc+i, and hence /3o = Amax(-^/3fe(f/fc)) < Pk generally holds. Also 
usually, ttpk+i ^ apk ^ 0. Using these relations along with fl25l) . f|T9l) and fl23|l . we further 
observe that 

f*pk+i < ^pfc+i(t/o) ~ V(f^fc) ^ /;fc+eo, 

fpk+i < gpk+i^fSg{Uo) ^ V,/3o(f^o) < gpi^AiUk) < fpk + eo. 

It follows that when f*k+i is close to t/o is nearly an eg-optimal solution for problems f|T9l) 
and ( l23i) with p = p^^^ and P = Po- Therefore, we expect that for the above choice of Uq and 
Po, the ASPG method can solve problems (I5l) and ( |T9l) with p = p'^^^ rapidly when p'^^^ is 
close to p*^. 

2.2.2 Adaptive Nesterov's smooth method 

In this subsection, we propose an adaptive Nesterov's smooth (ANS) method for solving 
problems ( fT9l) and ( |T71) (or equivalently, ([5])). 

Recently, Lu [8J studied Nesterov's smooth method P, for solving a special class of 
problems (fT9|) and (fT7|) (or equivalently, ([5])), where p is a positive multiple of ee^. He 
showed that an eo-optimal solution to problems (fT9l) and (fT7j) can be found in at most 
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\/2(3p{max pij) max \\U — Uo\\f/\/^ iterations by Nesterov's smooth method for some initial 

point Uq E U (see pp. 12 of [8] for details). Given that Pp is an estimate and typically 
an overestimate of Xma.x{Xp), where X* is the unique optimal solution of problem ([5]), the 
aforementioned iteration complexity can be exceedingly large and Nesterov's smooth method 
generally converges extremely slowly. Lu [5] further proposed an adaptive Nesterov's smooth 
(ANS) method for solving problems (flQll and (fT7j) (see pp. 15 of [8j). In his method, Amax(-^p) 
is estimated by Amax(^(^fc)) and adaptively adjusted based on the change of Aniax(-^(f^fc)) as 
the algorithm progresses, where Uk is an approximate solution of problem (fT9l) . As a result, his 
method can provide an asymptotically tight estimate of Xmaxi^p) and it has an asymptotically 
optimal iteration complexity. 

We now extend the ANS method [Sj to problems f|T9l) and f|T7|) (or equivalently, ([5])) with 
a general p. Recall from Subsection 12.2.11 that VgpiJJ) is Lipschitz continuous on U with 
constant L = /5^(maxpjj)^ with respect to the Frobenius norm. Then it is straightforward 

to extend the ANS method [8j to problems (IT^ and for a general p by replacing the 
corresponding Lipschitz constants by the ones computed according to the above formula. For 
ease of reference, we provide the details of the ANS method for problems f|T9|) and f|T7j) (or 
equivalently, ([5])) below. 

Throughout the remainder of this section, we assume that ap, Pp, gp,p{-) and Xp{-) are 
given in Proposition 12.51 and Subsection 12.2. 1^ respectively. We now introduce a definition 
that will be used subsequently. 

Definition 2 Given any U eU and (3 G [op, (3p], Xp{U) is called "active" if \ma.y.{X (3{U)) = (3 
and (3 < (3p; otherwise it is called "inactive" . 

We are now ready to present the ANS method [8] for problems (fT9|) and (ITTl) . 
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The ANS method for problems ([Tf]) and ([H]) 

Let e > 0, ^1, <,2 > 1, and let ^3 G (0, 1) be given. Let pmax = niaxpjj. Choose Uq E U and 
(3 G [ap,/5p]. Set L = /?^Pmax) ^ = k = 0. 

1) Compute XfiiUk)- 

la) If Xp{Uk) is active, find the smallest s G 2^+ such that Xp{Uk) is inactive, where 
/? = min{?^/3, /?p}. Set k = 0, Uq = Uk, P = P, L = /^^p^^x ^i^d go to step 2). 

lb) If X^(f/fe) is inactive and Amax(-^/3(f/fc)) < ^"3/5, set = 0, [/q = t/fc, 
/3 = max{min{^2Amax(-^/3(f/fc)),/5p},ap}, and L = /^VLx- 

2) If Qp^piUk) - fp{Xp{Uk)) < e, terminate. Otherwise, compute Vgpfi{Uk)- 

3) Find t/f = argmin {(V^p,;3(t/fc), U-Uk) + ^ \\U - Uk\\l : U eU). 

4) Find Ul' = argmin l^llf/ - U4l + ^^[gpAUi) + (V^7p,^(f/0, f/ - t/,)] : f/ G w|. 

5) Set = 43f/r + gif/,^'^. 

6) Set /c ^ /c + 1, and go to step 1). 
end 

Similarly as the ASPG method, we can easily extend the ANS method to find an (co, Cc)- 
optimal solution to problem (jl]) by applying the same strategy for updating the initial Uq 
and /3o detailed at the end of Subsection 12.2. 1[ For convenience of presentation, the resulting 
method is referred to as the adaptive Nesterov's smooth (ANS) method. 



3 Computational results 

In this section, we test the sparse recovery ability of the model (jlj) and compare the per- 
formance of the adaptive spectral projected gradient (ASPG) method and the adaptive Nes- 
terov's smooth (ANS) method that are proposed in Section [2] for solving problem (jlj) on a set 
of randomly generated instances. 

All instances used in this section were randomly generated in a similar manner as described 
in dAspremont et al. [5] and Lu [8]. Indeed, we first generate a sparse matrix A G 5"+, and 
then we generate a matrix B G S'"' by 

B = A-^ + tV, 

where V G 5" contains pseudo-random values drawn from a uniform distribution on the 
interval [—1,1], and r is a small positive number. Finally, we obtain the following randomly 
generated sample covariance matrix: 

S = 5-min{A^in(5)-^9,0}/, 
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Table 1: Comparison of ASPG and ANS for g = 0.1 
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576896 
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6601.2 


6500.5 
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1660 
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1660 


1154 


12975.7 
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2600 


1285 


2600 


1903 


27523.2 


20059.9 



where "i? is a small positive number. In particular, we set r = 0.15, -i? = l.Oe — 4 for generating 
all instances. 

In the first experiment we compare the performance of the ASPG and ANS methods for 
problem (jll). For this purpose, we first randomly generate the above matrix A G S^_^_ with 
a density prescribed by g, and set fl = ■ Aij = 0,\i — j\ > 2} and pij = 0.5 for all 

^ S is then generated by the above approach. The codes for both methods are 
written in MATLAB. In particular, we set 7 = 10-^ M = 8, di = 0.1, (Xs = 0.9, Omin = 10-^^ 
Oimax = 10^^ for the ASPG method, and set ?i = <J2 = 1-05 and <^3 = 0.95 for the ANS method. 
In addition, for both methods we set Pq = 1, = 10, = 2, and = 0.5 for all G fl. 
Also, the ASPG and ANS methods start from the initial point Uq = and terminate once 
an (eo, ec)-optimal solution of problem (jlj) is found, where = 0.1 and ec = 10""^. All 
computations are performed on an Intel Xeon 2.66 GHz machine with Red Hat Linux version 
8. 

The performance of the ASPG and ANS methods for the randomly generated instances 
with density g = 0.1, 0.5 and 0.9 is presented in Tables [T]l3l respectively. The row size n of 
each sample covariance matrix S is given in column one. The size of the set Q is given in 
column two. The numbers of (inner) iterations of ASPG and ANS are given in columns three 
to four, the number of function evaluations are given in columns five to six, and the CPU times 
(in seconds) are given in the last two columns, respectively. From Tables WM, we see that 
both methods are able to solve all instances within a reasonable amount of time. In addition, 
the ASPG method, namely, the adaptive spectral gradient method, generally outperforms the 
ANS method, that is, the adaptive Nesterov's smooth method. 

Our second experiment is similar to the one carried out in d'Aspremont et al. [5]. We 
intend to test the sparse recovery ability of the model (jl]). To this aim, we speciahze n = 30 
and the matrix A e S'!l_^ to be the one with diagonal entries around one and a few randomly 
chosen, nonzero off-diagonal entries equal to +1 or —1 and the sample covariance matrix S is 
then generated by the aforementioned approach. Also, we set Q = : Aij = 0, > 5} 
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Table 2: Comparison of ASPG and ANS for q = 0.5 
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Table 3: Comparison of ASPG and ANS for g = 0.9 
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and Pij = 0.1 for all ^ ^- The model (jlj) with such an instance is finally solved by the 
ASPG method whose parameters, initial point and termination criterion are exactly same as 
above. In Figure [H we plot the sparsity patterns of the original inverse covariance matrix A, 
the approximate solution to problem (jlj) and the noisy inverse covariance matrix for such 
a randomly generated instance. We observe that the model (jl]) is capable of recovering the 
sparsity pattern of the original inverse covariance matrix. 

4 Concluding remarks 

In this paper, we considered estimating sparse inverse covariance of a Gaussian graphical model 
whose conditional independence is assumed to be partially known. Naturally, we formulated 
it as a constrained /i-norm penalized maximum likelihood estimation problem. Further, we 
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Original inverse A Approximate solution of (4) Noisy inverse 

Figure 1: Sparsity recovery. 

proposed an algorithm framework, and developed two first-order methods, that is, adaptive 
spectral projected gradient (ASPG) method and adaptive Nesterov's smooth (ANS) method, 
for solving it. Our computational results demonstrate that both methods are able to solve 
problems of size at least a thousand and number of constraints of nearly a half million within 
a reasonable amount of time, and the ASPG method generally outperforms the ANS method. 

The source codes for the ASPG and ANS methods (written in MATLAB) are available 
online at www.math.sfu.ca/~zhaosong. They can also be applied to problem (jl]) with = 0, 
namely, the case where the underlying sparsity structure is completely unknown. It shall be 
mentioned that these codes can be extended straightforwardly to more general problems of 
the form 

max logdetX — (S,X) — ^ pij\Xij\ 
s.t. al ^ (31, 

X,, = 0, 

where < a < /5 < oo are some fixed bounds on the eigenvalues of the solution. 
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